Session 10: Conclusion

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2025-07-18

This Course

Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Conclusion

1. Introduction

1. Introduction: What we learned about

  • Using Quarto and RStudio projects in the course
  • Packages and functions in R
  • How to use the R help docs and other ways to learn more
  • Functions, Data, Loops and If in R
  • Tidyverse vs. base R and the pipe |>
  • Literate programming

2. Data Structures and Wrangling

2. Data Structures and Wrangling: What we learned about

  • how data plays into the research process
  • the difference between content and structure of data
  • about the basic data structures in R and what they are good for
  • how to turn information into data
  • the key role of tables
  • and how to turn bad data structures into good tables

Zane Lee via unsplash.com

3. Working with Files

3. Working with Files: What we learned about

  • how to use files efficiently and how to solve problems using files
  • good practices for transparent and efficient file usage
  • how to work with many files at the same time
  • and how you can facilitate collaborative working with files

JF Martin via unsplash.com

4. Linking and joining data & SQL

4. Linking and joining data & SQL: What we learned about

  • why and how to work with relational data
  • how to join data from different tables in R
  • how to join data from different tables in SQL

Via DALL-E

5. Scaling, Reporting and Database Software

5. Scaling, Reporting and Database Software: What we learned about

  • Repetition: DBMS
  • Working with PostgreSQL
  • Working with text databases
  • Benchmarking
  • Final scaling tips

Nik via unsplash.com

6. Introduction to the Web

6. Introduction to the Web: What we learned about

In this session, we learned how to scout data in the wild. We will:

  • discuss web scraping from a theoretical point of view:
    • What is web scraping?
    • Why should you learn it?
    • What legal and ethical implications should you keep in mind?
  • learn a bit more about how the Internet works
    • What is HTML
    • What is CSS

Angie Gade via unsplash.com

7. Static Web Pages

7. Static Web Pages: What we learned about

In this session, we trapped some docile data that wants to be found. We will:

  • Go over some parsing examples:
    • Wikipedia: World World Happiness Report
  • Discuss some examples of good approaches to data wrangling
  • Go into a bit more detail on requesting raw data

Original Image Source: prowebscraper.com

Joe Caione via unsplash.com

8. Application Programming Interface (APIs)

8. Application Programming Interface (APIs) : What we learned about

In this session, we learned how to adopt data from someone else. We will:

  • Learn what an API is and what parts it consists of
  • Learn about httr2, a modern intuitive package to communicate with APIs
  • Discuss some examples:
    • A simple first API: The Guardian API
    • UK Parliament API
    • Semantic Scholar API
  • Go into a bit more detail on requesting raw data

Original Image Source: prowebscraper.com

9. Interactive Web Pages

9. Interactive Web Pages: What we learned about

In this session, we learned how to hunt down wild data. We will:

  • Learn how to find secret APIs
  • Emulate a Browser
  • We focus specifically on step 1 below

Original Image Source: prowebscraper.com

Philipp Pilz via unsplash.com

Some more things…

Using ‘AI’ web scrapers

Essentially two kinds of ‘AI’ web scrapers:

‘AI’ parsers

  • download html of a website
  • prompt ‘AI’ to extract certain information and return in a good structure

advantages :

  • no scraping skills needed whatsoever
  • can deal with complicated structures

disadvantages :

  • expensive (computational, time, but also 💸)
  • does not scale well
  • limit on the length of html content (depends on model context)
  • potential for hallucination

verdict:

👉 don’t believe the hype, skip this one

Using ‘AI’ web scrapers

Essentially two kinds of ‘AI’ web scrapers:

‘AI’ written parsers

  • download html of a website
  • prompt ‘AI’ to extract appropriate CSS selectors / write R code

advantages :

  • only some scraping skills needed
  • can deal with complicated structures
  • scales well

disadvantages :

  • potential for hallucination (and inaccuracies)
  • limit on the length of html content (depends on model context)

verdict:

👉 try to use them (but with tests and caution!)

Using ‘AI’ web scrapers

Essentially two kinds of ‘AI’ web scrapers:

‘AI’ parsers

  • download html of a website
  • prompt ‘AI’ to extract certain information and return in a good structure

advantages :

  • no scraping skills needed whatsoever
  • can deal with complicated structures

disadvantages :

  • expensive (computational, time, but also 💸)
  • does not scale well
  • limit on the length of html content (depends on model context)
  • potential for hallucination

verdict:

👉 don’t believe the hype, skip this one

‘AI’ written parsers

  • download html of a website
  • prompt ‘AI’ to extract appropriate CSS selectors / write R code

advantages :

  • only some scraping skills needed
  • can deal with complicated structures
  • scales well

disadvantages :

  • potential for hallucination (and inaccuracies)
  • limit on the length of html content (depends on model context)

verdict:

👉 try to use them (but with tests and caution!)

Creating literate programming reports from R

  • instead of creating code in one file and documentation/description in another: combine into literate programming
  • some popular systems: Jupyter Notebook, Quarto/R Markdown
  • when rendered, code is executed from top to bottom, making sure the data collection/analysis/communication is reproducible
  • by keeping text and code together, you can make sure both are on the newest version

My personal template:

# eval: false
remotes::install_github("JBGruber/jbgtemplates")
jbgtemplates::report_template("ESS_exam")

paperboy: my webscraping framework (for news)